Analysis of LinkedIn Jobs and Skills (2024)¶
Vikram Rangarajan, Ved Karamsetty
Vikram Rangarajan: Contributed to B-H (95%)
Ved Karamsetty: Contributed to A, B, C, E, F, G, H (75%)
Specific Contributions:
Vikram: Did most of the coding, Ved's PC was unable to (this dataset is MASSIVE), and Colab did not allow a GPU for extended periods of time (required for ML section).
Ved: Chose dataset, gave inputs on things like how to preprocess data, what hypothesis tests to do, ML techniques, visualization, etc.
We did most of the code together in a call and screen share.
We, all team members, agree together that the above information is true, and we are confident about our contributions to this submitted project/final tutorial.
- Vikram Rangarajan, Ved Karamsetty, 5/6/24
1. Introduction¶
The topic of our project is the analysis of LinkedIn jobs during 2024. The goal is to find insights on job locations, key skills required, and industry statistics. As college students, finding a job will be the next stage of our careers, and these insights could give us an upper hand when job searching. Specifically, for Computer Science students like us, we can see job trends relating to our skillset.
Required Python libraries:
- Pandas
- pycountry
- Jupyter for the notebook and IPython.display.display
- plotly
- fastparquet (optional, if you are using parquets)
- statsmodels for hypothesis testing
- scipy
- [nbformat>=4.2.0] for plotly's mimetype rendering backend
2. Data Curation¶
We will be analyzing the 1.3 Linkedin Jobs and Skills (2024) dataset on Kaggle. This dataset was webscraped from the website itself, giving us accurate and up-to-date data to work with.
This dataset consists of 3 tables:
- job_skills.csv
- This contains the expected job skills for every job posting.
- job_link: Foreign key, str
- Link to Linkedin posting
- skills: str
- Comma separated string values. Ex: Customer Service, Communication, Typing
- job_link: Foreign key, str
- This contains the expected job skills for every job posting.
- job_summary.csv
- This contains the summary (job description) for each job posting, alongside the link to the posting, making 2 total columns.
- job_link: Foreign key, str
- Link to Linkedin posting
- job_summary: str
- Job summary, plain text.
- job_link: Foreign key, str
- This contains the summary (job description) for each job posting, alongside the link to the posting, making 2 total columns.
- linkedin_job_postings.csv
- This is the most important table. It contains several columns:
- job_link: Foreign key, str
- Link to Linkedin posting
- last_processed_time: str (to be converted to datetime)
- Timestamp indicating the last time the job posting was processed
- got_summary: str (to be converted to bool)
- Indicates whether the job summary was successfully extracted or not
- got_ner: str (to be converted to bool)
- Indicates whether Named Entity Recognition (NER) was performed on the job posting or not
- is_being_worked: str (to be converted to bool)
- Indicates if the job posting is currently being worked on or not
- job_title: str
- Title of the job listing
- company: str
- Company name offering the job position
- job_location: str
- Location of the job position. This is a comma separated string, coming in many formats
- job_type: str
- Either remote, onsite, or hybrid.
- job_level: str
- Either Apprentice or Mid-Senior level.
- first_seen: str (to be converted to datetime)
- Timestamp indicating when the job posting was first seen
- search_city: str
- City used as a search criterion for collecting the job postings
- job_link: Foreign key, str
- This is the most important table. It contains several columns:
A problem faced was that these tables, especially job_summary.csv, were extremely large in size (>5GB), and took up a lot of disk space and took lots of time to load. To alleviate this, we converted these 3 csv's to parquet files to decrease disk space and loading times. Note that this is not necessary, this is just because while debugging we want to minimize loading times as much as possible. We have provided a script to do this conversion.
Unfortunately, geospatial (latitude & longitude) data is not included with the job postings, so another dataset is used to join each job posting with the correct latitude and longitude, for geospatial plots. We chose the Countries States Cities Database repository, as it is popular, it has been starred 6.3k times, is open source, and contains just what we need, which is the cities.csv file.
- cities.csv
- Contains city data for many cities worldwide.
- id: int
- Unused
- name: str
- Name of city
- state_id: int
- Unused
- state_code: str
- Unused
- country_id: int
- Unused
- country_code: str (2 letters)
- The ISO 3166-1 alpha-2 code of the country the city is in
- country_name: str
- The name of the country which the city is in
- latitude: float
- Latitude of city
- longitude: float
- Longitude of city
- wikiDataId: str
- Unused
- id: int
- Contains city data for many cities worldwide.
3. Exploratory Data Analysis (EDA)¶
3a. Data Preprocessing¶
We need to do a lot of data processing on this dataset, everything from data types to adding data from other tables, and a lot of string processing.
We will start off with our imports and preliminary settings
import pandas as pd
import pycountry
from IPython.display import display
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
from collections import Counter
import os
from statsmodels.stats.proportion import proportions_ztest
from scipy import stats
import itertools
from sklearn import cluster, metrics, feature_extraction, decomposition
from wordcloud import WordCloud
from sentence_transformers import SentenceTransformer
# This allows for interactive maps when using jupyter nbconvert to create this HTML page
pio.renderers.default = "plotly_mimetype+notebook"
pd.options.plotting.backend = "plotly"
Now we will load linkedin_job_postings.csv.
if os.path.isfile("linkedin_job_postings.parquet"):
postings = pd.read_parquet("linkedin_job_postings.parquet")
else:
postings = pd.read_csv("linkedin_job_postings.csv")
display(postings)
| job_link | last_processed_time | got_summary | got_ner | is_being_worked | job_title | company | job_location | first_seen | search_city | search_country | search_position | job_level | job_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://www.linkedin.com/jobs/view/account-exe... | 2024-01-21 07:12:29.00256+00 | t | t | f | Account Executive - Dispensing (NorCal/Norther... | BD | San Diego, CA | 2024-01-15 | Coronado | United States | Color Maker | Mid senior | Onsite |
| 1 | https://www.linkedin.com/jobs/view/registered-... | 2024-01-21 07:39:58.88137+00 | t | t | f | Registered Nurse - RN Care Manager | Trinity Health MI | Norton Shores, MI | 2024-01-14 | Grand Haven | United States | Director Nursing Service | Mid senior | Onsite |
| 2 | https://www.linkedin.com/jobs/view/restaurant-... | 2024-01-21 07:40:00.251126+00 | t | t | f | RESTAURANT SUPERVISOR - THE FORKLIFT | Wasatch Adaptive Sports | Sandy, UT | 2024-01-14 | Tooele | United States | Stand-In | Mid senior | Onsite |
| 3 | https://www.linkedin.com/jobs/view/independent... | 2024-01-21 07:40:00.308133+00 | t | t | f | Independent Real Estate Agent | Howard Hanna | Rand Realty | Englewood Cliffs, NJ | 2024-01-16 | Pinehurst | United States | Real-Estate Clerk | Mid senior | Onsite |
| 4 | https://www.linkedin.com/jobs/view/group-unit-... | 2024-01-19 09:45:09.215838+00 | f | f | f | Group/Unit Supervisor (Systems Support Manager... | IRS, Office of Chief Counsel | Chamblee, GA | 2024-01-17 | Gadsden | United States | Supervisor Travel-Information Center | Mid senior | Onsite |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1348449 | https://www.linkedin.com/jobs/view/registered-... | 2024-01-20 15:21:07.786118+00 | t | t | f | Registered Nurse (RN) #CP-RN-7998660 - 2411627... | TravelNurseSource | Providence, RI | 2024-01-14 | Fall River | United States | Nurse Supervisor | Mid senior | Onsite |
| 1348450 | https://www.linkedin.com/jobs/view/constructio... | 2024-01-20 15:21:10.885264+00 | t | t | f | Construction Superintendent | Jobot | New Iberia, LA | 2024-01-15 | Lafayette | United States | Assistant Construction Superintendent | Mid senior | Onsite |
| 1348451 | https://www.linkedin.com/jobs/view/executive-c... | 2024-01-21 07:40:00.304641+00 | t | t | f | Executive Chef, Operations Support | NEXDINE Hospitality | Riverhead, NY | 2024-01-14 | Eastport | United States | Chef | Mid senior | Onsite |
| 1348452 | https://www.linkedin.com/jobs/view/rn-register... | 2024-01-21 00:38:39.816821+00 | t | t | f | RN- Registered Nurse, Analyst - - 23934913EXPP... | TravelNurseSource | Aurora, CO | 2024-01-16 | Colorado | United States | Occupational Analyst | Mid senior | Onsite |
| 1348453 | https://www.linkedin.com/jobs/view/on-demand-g... | 2024-01-21 00:38:44.231492+00 | t | t | f | On-Demand: Guest Advocate (Cashier), General M... | Target | Culver City, CA | 2024-01-12 | Malibu | United States | Cashier Ii | Mid senior | Onsite |
1348454 rows × 14 columns
Now we load job_skills.
if os.path.isfile("job_skills.parquet"):
skills = pd.read_parquet("job_skills.parquet")
else:
skills = pd.read_csv("job_skills.csv")
display(skills)
| job_link | job_skills | |
|---|---|---|
| 0 | https://www.linkedin.com/jobs/view/housekeeper... | Building Custodial Services, Cleaning, Janitor... |
| 1 | https://www.linkedin.com/jobs/view/assistant-g... | Customer service, Restaurant management, Food ... |
| 2 | https://www.linkedin.com/jobs/view/school-base... | Applied Behavior Analysis (ABA), Data analysis... |
| 3 | https://www.linkedin.com/jobs/view/electrical-... | Electrical Engineering, Project Controls, Sche... |
| 4 | https://www.linkedin.com/jobs/view/electrical-... | Electrical Assembly, Point to point wiring, St... |
| ... | ... | ... |
| 1296376 | https://www.linkedin.com/jobs/view/community-a... | Communication Skills, Time Management, Custome... |
| 1296377 | https://www.linkedin.com/jobs/view/sr-it-analy... | Windows SQL, EDI X12, Edifecs Platform, Health... |
| 1296378 | https://www.linkedin.com/jobs/view/operations-... | Adaptability, Communication, Digital Fluency, ... |
| 1296379 | https://www.linkedin.com/jobs/view/float-patie... | CNA, EMT, BLS, Medical Assistant, CPCT, LPN, R... |
| 1296380 | https://www.linkedin.com/jobs/view/conductor-e... | Customer Service, Driving, Loading, Unloading,... |
1296381 rows × 2 columns
Same for job_summary.csv
if os.path.isfile("job_summary.parquet"):
summary = pd.read_parquet("job_summary.parquet")
else:
summary = pd.read_csv("job_summary.csv")
display(summary)
| job_link | job_summary | |
|---|---|---|
| 0 | https://www.linkedin.com/jobs/view/restaurant-... | Rock N Roll Sushi is hiring a Restaurant Manag... |
| 1 | https://www.linkedin.com/jobs/view/med-surg-re... | Schedule\n: PRN is required minimum 12 hours p... |
| 2 | https://www.linkedin.com/jobs/view/registered-... | Description\nIntroduction\nAre you looking for... |
| 3 | https://uk.linkedin.com/jobs/view/commercial-a... | Commercial account executive\nSheffield\nFull ... |
| 4 | https://www.linkedin.com/jobs/view/store-manag... | Address:\nUSA-CT-Newington-44 Fenn Road\nStore... |
| ... | ... | ... |
| 1297327 | https://www.linkedin.com/jobs/view/roofing-sup... | We are currently seeking experienced commercia... |
| 1297328 | https://www.linkedin.com/jobs/view/service-cen... | Overview\nStable and growing organization\nCom... |
| 1297329 | https://www.linkedin.com/jobs/view/flight-qual... | Rôle et responsabilités\nJob Description:\nFli... |
| 1297330 | https://www.linkedin.com/jobs/view/global-sour... | Job Description\nAre You Ready to Make It Happ... |
| 1297331 | https://www.linkedin.com/jobs/view/research-as... | Current Employees:\nIf you are a current Staff... |
1297332 rows × 2 columns
Similar for cities.csv. Here, we also remove .'s and accented characters and other cleaning.
# https://github.com/dr5hn/countries-states-cities-database
cities = pd.read_csv("cities.csv")[
["name", "state_name", "country_code", "country_name", "latitude", "longitude"]
]
# Deletes .'s and replaces accented characters with standard alphabetical
cities["name"] = cities["name"].str.replace(".", "")
cities["name"] = (
cities["name"]
.str.normalize("NFKD")
.str.encode("ascii", errors="ignore")
.str.decode("utf-8")
)
print("Rows before dropping nulls and duplicates", cities.shape[0])
cities.drop_duplicates(["name", "state_name", "country_code"], inplace=True)
cities.dropna(inplace=True)
cities.reset_index(inplace=True, drop=True)
print("Rows after dropping nulls and duplicates", cities.shape[0])
print("Counts:")
print(cities.count())
display(cities[["name", "state_name", "country_code", "country_name"]].describe())
display(cities[["latitude", "longitude"]].describe())
display(cities)
Rows before dropping nulls and duplicates 150574 Rows after dropping nulls and duplicates 150497 Counts: name 150497 state_name 150497 country_code 150497 country_name 150497 latitude 150497 longitude 150497 dtype: int64
| name | state_name | country_code | country_name | |
|---|---|---|---|---|
| count | 150497 | 150497 | 150497 | 150497 |
| unique | 131214 | 3487 | 197 | 197 |
| top | Merkez | England | US | United States |
| freq | 51 | 2918 | 19820 | 19820 |
| latitude | longitude | |
|---|---|---|
| count | 150497.000000 | 150497.000000 |
| mean | 31.565417 | 2.328172 |
| std | 22.795936 | 68.026230 |
| min | -75.000000 | -179.121980 |
| 25% | 18.993450 | -58.963770 |
| 50% | 40.680660 | 8.657070 |
| 75% | 47.238680 | 27.732170 |
| max | 73.508190 | 179.466000 |
| name | state_name | country_code | country_name | latitude | longitude | |
|---|---|---|---|---|---|---|
| 0 | Ashkasham | Badakhshan | AF | Afghanistan | 36.68333 | 71.53333 |
| 1 | Fayzabad | Badakhshan | AF | Afghanistan | 37.11664 | 70.58002 |
| 2 | Jurm | Badakhshan | AF | Afghanistan | 36.86477 | 70.83421 |
| 3 | Khandud | Badakhshan | AF | Afghanistan | 36.95127 | 72.31800 |
| 4 | Raghistan | Badakhshan | AF | Afghanistan | 37.66079 | 70.67346 |
| ... | ... | ... | ... | ... | ... | ... |
| 150492 | Redcliff | Midlands Province | ZW | Zimbabwe | -19.03333 | 29.78333 |
| 150493 | Shangani | Midlands Province | ZW | Zimbabwe | -19.78333 | 29.36667 |
| 150494 | Shurugwi | Midlands Province | ZW | Zimbabwe | -19.67016 | 30.00589 |
| 150495 | Shurugwi District | Midlands Province | ZW | Zimbabwe | -19.75000 | 30.16667 |
| 150496 | Zvishavane District | Midlands Province | ZW | Zimbabwe | -20.30345 | 30.07514 |
150497 rows × 6 columns
Now we merge all of our LinkedIn dataframes into one dataframe called df and do some more cleaning.
df = postings.merge(skills, how="inner", on="job_link").merge(
summary, how="inner", on="job_link"
)
print("Rows before dropping nulls and duplicates", df.shape[0])
df.drop_duplicates("job_link", inplace=True)
df.dropna(inplace=True)
df.reset_index(inplace=True, drop=True)
print("Rows after dropping nulls and duplicates", df.shape[0])
# We only want the alphabetical letters
df["job_skills"] = df["job_skills"].str.lower().replace("[^a-z]", " ", regex=True)
# We clean job_location to remote accented characters
df["job_location"] = (
df["job_location"]
.str.replace(".", "")
.str.normalize("NFKD")
.str.encode("ascii", errors="ignore")
.str.decode("utf-8")
)
print("Counts:")
print(df.count())
display(df.describe())
display(df)
Rows before dropping nulls and duplicates 1296381 Rows after dropping nulls and duplicates 1294268 Counts: job_link 1294268 last_processed_time 1294268 got_summary 1294268 got_ner 1294268 is_being_worked 1294268 job_title 1294268 company 1294268 job_location 1294268 first_seen 1294268 search_city 1294268 search_country 1294268 search_position 1294268 job_level 1294268 job_type 1294268 job_skills 1294268 job_summary 1294268 dtype: int64
| job_link | last_processed_time | got_summary | got_ner | is_being_worked | job_title | company | job_location | first_seen | search_city | search_country | search_position | job_level | job_type | job_skills | job_summary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1294268 | 1294268 | 1294268 | 1294268 | 1294268 | 1294268 | 1294268 | 1294268 | 1294268 | 1294268 | 1294268 | 1294268 | 1294268 | 1294268 | 1294268 | 1294268 |
| unique | 1294268 | 720667 | 1 | 1 | 1 | 564775 | 88932 | 28773 | 6 | 1018 | 4 | 1922 | 2 | 3 | 1285967 | 955934 |
| top | https://www.linkedin.com/jobs/view/account-exe... | 2024-01-19 09:45:09.215838+00 | t | t | f | LEAD SALES ASSOCIATE-FT | Health eCareers | New York, NY | 2024-01-14 | North Carolina | United States | Account Executive | Mid senior | Onsite | front counter drivethru outside order taker ... | Dollar General Corporation has been delivering... |
| freq | 1 | 573435 | 1294268 | 1294268 | 1294268 | 7315 | 40010 | 12561 | 458587 | 9485 | 1103592 | 19463 | 1153364 | 1283454 | 183 | 4565 |
| job_link | last_processed_time | got_summary | got_ner | is_being_worked | job_title | company | job_location | first_seen | search_city | search_country | search_position | job_level | job_type | job_skills | job_summary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://www.linkedin.com/jobs/view/account-exe... | 2024-01-21 07:12:29.00256+00 | t | t | f | Account Executive - Dispensing (NorCal/Norther... | BD | San Diego, CA | 2024-01-15 | Coronado | United States | Color Maker | Mid senior | Onsite | medical equipment sales key competitors term... | Responsibilities\nJob Description Summary\nJob... |
| 1 | https://www.linkedin.com/jobs/view/registered-... | 2024-01-21 07:39:58.88137+00 | t | t | f | Registered Nurse - RN Care Manager | Trinity Health MI | Norton Shores, MI | 2024-01-14 | Grand Haven | United States | Director Nursing Service | Mid senior | Onsite | nursing bachelor of science in nursing maste... | Employment Type:\nFull time\nShift:\nDescripti... |
| 2 | https://www.linkedin.com/jobs/view/restaurant-... | 2024-01-21 07:40:00.251126+00 | t | t | f | RESTAURANT SUPERVISOR - THE FORKLIFT | Wasatch Adaptive Sports | Sandy, UT | 2024-01-14 | Tooele | United States | Stand-In | Mid senior | Onsite | restaurant operations management inventory ma... | Job Details\nDescription\nWhat You'll Do\nAs a... |
| 3 | https://www.linkedin.com/jobs/view/independent... | 2024-01-21 07:40:00.308133+00 | t | t | f | Independent Real Estate Agent | Howard Hanna | Rand Realty | Englewood Cliffs, NJ | 2024-01-16 | Pinehurst | United States | Real-Estate Clerk | Mid senior | Onsite | real estate customer service sales negotiat... | Who We Are\nRand Realty is a family-owned brok... |
| 4 | https://www.linkedin.com/jobs/view/registered-... | 2024-01-21 08:08:19.663033+00 | t | t | f | Registered Nurse (RN) | Trinity Health MI | Muskegon, MI | 2024-01-14 | Muskegon | United States | Nurse Practitioner | Mid senior | Onsite | nursing bsn medical license virtual rn nur... | Employment Type:\nFull time\nShift:\n12 Hour N... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1294263 | https://www.linkedin.com/jobs/view/registered-... | 2024-01-20 15:21:07.786118+00 | t | t | f | Registered Nurse (RN) #CP-RN-7998660 - 2411627... | TravelNurseSource | Providence, RI | 2024-01-14 | Fall River | United States | Nurse Supervisor | Mid senior | Onsite | registered nurse bls certification nursing c... | TravelNurseSource is working with Adelphi Medi... |
| 1294264 | https://www.linkedin.com/jobs/view/constructio... | 2024-01-20 15:21:10.885264+00 | t | t | f | Construction Superintendent | Jobot | New Iberia, LA | 2024-01-15 | Lafayette | United States | Assistant Construction Superintendent | Mid senior | Onsite | construction management project planning est... | Want to learn more about this role and Jobot? ... |
| 1294265 | https://www.linkedin.com/jobs/view/executive-c... | 2024-01-21 07:40:00.304641+00 | t | t | f | Executive Chef, Operations Support | NEXDINE Hospitality | Riverhead, NY | 2024-01-14 | Eastport | United States | Chef | Mid senior | Onsite | culinary chef director menu writing cycle o... | NEXDINE HOSPITALITY one of the nation's fastes... |
| 1294266 | https://www.linkedin.com/jobs/view/rn-register... | 2024-01-21 00:38:39.816821+00 | t | t | f | RN- Registered Nurse, Analyst - - 23934913EXPP... | TravelNurseSource | Aurora, CO | 2024-01-16 | Colorado | United States | Occupational Analyst | Mid senior | Onsite | registered nurse analyst rn registered nurse... | TravelNurseSource is working with Alliant Pers... |
| 1294267 | https://www.linkedin.com/jobs/view/on-demand-g... | 2024-01-21 00:38:44.231492+00 | t | t | f | On-Demand: Guest Advocate (Cashier), General M... | Target | Culver City, CA | 2024-01-12 | Malibu | United States | Cashier Ii | Mid senior | Onsite | customer service communication problem solvi... | Description: Starting Hourly Rate / Salario po... |
1294268 rows × 16 columns
del postings, summary, skills # Clear memory
Now, we do datatype manipulation. We have to:
- Convert t & f to python True and False (in the postings table)
- Convert date strings to pd.datetime64 (postings table)
- Turn comma separated strings into lists (skills and postings tables)
- Take job location and pull out the city, state, and country for each posting, and imputate values as needed
- Join postings table with cities table to get latitudes and longitudes of every city (for geospatial visualization)
def t_f_to_bool(x):
return True if x == "t" else False
def split_str(x):
if isinstance(x, str):
return x.split(", ")
return []
job_level = {"Associate": 0, "Mid senior": 1}
Postings table manipulation
# Turn t and f into True and False
for col in ["got_summary", "got_ner", "is_being_worked"]:
df[col] = df[col].apply(t_f_to_bool)
# Convert strings to dates
df["last_processed_time"] = pd.to_datetime(
df["last_processed_time"],
format="ISO8601",
)
df["first_seen"] = pd.to_datetime(
df["first_seen"],
format="ISO8601",
)
df.head()
| job_link | last_processed_time | got_summary | got_ner | is_being_worked | job_title | company | job_location | first_seen | search_city | search_country | search_position | job_level | job_type | job_skills | job_summary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://www.linkedin.com/jobs/view/account-exe... | 2024-01-21 07:12:29.002560+00:00 | True | True | False | Account Executive - Dispensing (NorCal/Norther... | BD | San Diego, CA | 2024-01-15 | Coronado | United States | Color Maker | Mid senior | Onsite | medical equipment sales key competitors term... | Responsibilities\nJob Description Summary\nJob... |
| 1 | https://www.linkedin.com/jobs/view/registered-... | 2024-01-21 07:39:58.881370+00:00 | True | True | False | Registered Nurse - RN Care Manager | Trinity Health MI | Norton Shores, MI | 2024-01-14 | Grand Haven | United States | Director Nursing Service | Mid senior | Onsite | nursing bachelor of science in nursing maste... | Employment Type:\nFull time\nShift:\nDescripti... |
| 2 | https://www.linkedin.com/jobs/view/restaurant-... | 2024-01-21 07:40:00.251126+00:00 | True | True | False | RESTAURANT SUPERVISOR - THE FORKLIFT | Wasatch Adaptive Sports | Sandy, UT | 2024-01-14 | Tooele | United States | Stand-In | Mid senior | Onsite | restaurant operations management inventory ma... | Job Details\nDescription\nWhat You'll Do\nAs a... |
| 3 | https://www.linkedin.com/jobs/view/independent... | 2024-01-21 07:40:00.308133+00:00 | True | True | False | Independent Real Estate Agent | Howard Hanna | Rand Realty | Englewood Cliffs, NJ | 2024-01-16 | Pinehurst | United States | Real-Estate Clerk | Mid senior | Onsite | real estate customer service sales negotiat... | Who We Are\nRand Realty is a family-owned brok... |
| 4 | https://www.linkedin.com/jobs/view/registered-... | 2024-01-21 08:08:19.663033+00:00 | True | True | False | Registered Nurse (RN) | Trinity Health MI | Muskegon, MI | 2024-01-14 | Muskegon | United States | Nurse Practitioner | Mid senior | Onsite | nursing bsn medical license virtual rn nur... | Employment Type:\nFull time\nShift:\n12 Hour N... |
Here, we split job_location into a list of strings, as opposed to a comma separated string like it was before.
df["job_loc_list"] = df["job_location"].apply(split_str)
We will eventually be dropping any job locations which greater than 3 or less than 1 elements. There are only 28 of them, which is negligible in the face of 1.3 million
length_of_cities = df["job_loc_list"].map(len)
num_invalid = df["job_loc_list"][
(length_of_cities > 3) | (length_of_cities < 1)
].shape[0]
num_invalid, round(num_invalid / df["job_loc_list"].shape[0] * 100, 2)
(28, 0.0)
These are some dictionaries for the imputation to fill in missing values. It has US state codes to full state names, and vice versa. Also, there is a mapping for every U.S. state to the most populous city for if the city is missing. We got these values from wikipedia.
us_states = {
"AL": "Alabama",
"AK": "Alaska",
"AZ": "Arizona",
"AR": "Arkansas",
"CA": "California",
"CO": "Colorado",
"CT": "Connecticut",
"DE": "Delaware",
"FL": "Florida",
"GA": "Georgia",
"HI": "Hawaii",
"ID": "Idaho",
"IL": "Illinois",
"IN": "Indiana",
"IA": "Iowa",
"KS": "Kansas",
"KY": "Kentucky",
"LA": "Louisiana",
"ME": "Maine",
"MD": "Maryland",
"MA": "Massachusetts",
"MI": "Michigan",
"MN": "Minnesota",
"MS": "Mississippi",
"MO": "Missouri",
"MT": "Montana",
"NE": "Nebraska",
"NV": "Nevada",
"NH": "New Hampshire",
"NJ": "New Jersey",
"NM": "New Mexico",
"NY": "New York",
"NC": "North Carolina",
"ND": "North Dakota",
"OH": "Ohio",
"OK": "Oklahoma",
"OR": "Oregon",
"PA": "Pennsylvania",
"RI": "Rhode Island",
"SC": "South Carolina",
"SD": "South Dakota",
"TN": "Tennessee",
"TX": "Texas",
"UT": "Utah",
"VT": "Vermont",
"VA": "Virginia",
"WA": "Washington",
"WV": "West Virginia",
"WI": "Wisconsin",
"WY": "Wyoming",
"DC": "District of Columbia",
"AS": "American Samoa",
"GU": "Guam",
"MP": "Northern Mariana Islands",
"PR": "Puerto Rico",
"UM": "United States Minor Outlying Islands",
"VI": "US Virgin Islands",
}
us_states |= {v: k for k, v in us_states.items()}
# https://en.wikipedia.org/wiki/List_of_largest_cities_of_U.S._states_and_territories_by_population
state_cities = {
"Alabama": "Huntsville",
"Alaska": "Anchorage",
"Arizona": "Phoenix",
"Arkansas": "Little Rock",
"California": "Los Angeles",
"Colorado": "Denver",
"Connecticut": "Bridgeport",
"Delaware": "Wilmington",
"Florida": "Jacksonville",
"Georgia": "Atlanta",
"Hawaii": "Honolulu",
"Idaho": "Boise",
"Illinois": "Chicago",
"Indiana": "Indianapolis",
"Iowa": "Des Monies",
"Kansas": "Wichita",
"Kentucky": "Louisville",
"Louisiana": "New Orleans",
"Maine": "Portland",
"Maryland": "Baltimore",
"Massachusetts": "Boston",
"Michigan": "Detroit",
"Minnesota": "Minneapolis",
"Mississippi": "Jackson",
"Missouri": "Kansas City",
"Montana": "Billings",
"Nebraska": "Omaha",
"Nevada": "Las Vegas",
"New Hampshire": "Manchester",
"New Jersey": "Newark",
"New Mexico": "Albuquerque",
"New York": "New York City",
"North Carolina": "Charlotte",
"North Dakota": "Fargo",
"Ohio": "Columbus",
"Oklahoma": "Oklahoma City",
"Oregon": "Portland",
"Pennsylvania": "Philadelphia",
"Rhode Island": "Providence",
"South Carolina": "Charleston",
"South Dakota": "Sioux Falls",
"Tennessee": "Nashville",
"Texas": "Houston",
"Utah": "Salt Lake City",
"Vermont": "Burlington",
"Virginia": "Virginia Beach",
"Washington": "Seattle",
"West Virginia": "Charleston",
"Wisconsin": "Milwaukee",
"Wyoming": "Cheyenne",
"District of Columbia": "Washington DC",
}
This monster of a code block is responsible for taking the job_location and pulling out the city, state, and country. It's complicated because this dataset has it in like 20 different formats, and this tries to account for a lot of them.
It will also print out the invalid locations it encounters, or at least the most frequent ones.
valid_cities = set(
zip(
cities["name"],
cities["state_name"],
cities["country_code"],
)
)
valid_countries = set(cities["country_code"])
us_cities = dict(
zip(
cities[cities["country_code"] == "US"]["name"],
cities[cities["country_code"] == "US"]["state_name"],
)
)
inv = Counter()
def to_columns(loc):
city = state = country = None
if len(loc) == 1:
item = loc[0].replace(" Metropolitan Area", "")
if item == "United States" or "New York" in item:
city = "New York City"
state = "New York"
country = "US"
if item not in us_cities:
item = item.replace("St ", "Saint ")
for us_city, us_state in us_cities.items():
if us_city in item:
city = us_city
state = us_state
country = "US"
if len(loc) == 2:
"""
Formats:
- city, state (U.S.)
- United Kingdom, Great Britain (imputate city with London)
- state, United States
"""
temp_state = loc[1].replace(" Metropolitan Area", "")
if temp_state in us_states:
state = us_states[temp_state]
city = loc[0].replace(" Metropolitan Area", "")
country = "US"
if city == state == "New York":
city = "New York City"
if city not in us_cities:
city = city.replace("St ", "Saint ")
if city not in us_cities:
for us_city, us_state in us_cities.items():
if us_city in loc[0] and state == us_state:
city = us_city
elif loc[1] == "United States" or loc[1] == "US":
state = loc[0].replace(" Metropolitan Area", "")
# Imputates with largest city in state
city = state_cities[state]
country = "US"
elif loc[1] == "United Kingdom":
state = loc[0].replace(" Metropolitan Area", "")
# Imputate with London
city = "London"
country = "GB"
if city == "Washington" and state == "District of Columbia":
city = "Washington DC"
elif len(loc) == 3:
"""
Formats:
city, state, country
"""
# Full country name to alpha 2 code. Ex: United States -> US
country_obj = pycountry.countries.get(name=loc[2])
if country_obj is not None:
country = country_obj.alpha_2
else:
if loc[2] == "Taiwan":
country = "TW"
elif loc[2] == "Turkey":
country = "TR"
elif loc[2] == "Tanzania":
country = "TZ"
city = loc[0].replace(" Metropolitan Area", "")
# Code to full name. Ex: AL -> Alabama
if country == "US" and loc[1] in us_states.keys() and len(loc[1]) == 2:
state = us_states[loc[1]]
else:
state = loc[1].replace(" Metropolitan Area", "")
if (city, state, country) not in valid_cities:
inv[", ".join(loc) + f", Split: {city}, {state}, {country}"] += 1
return [city, state, country]
df[["city", "state", "country"]] = pd.DataFrame(
list(df["job_loc_list"].map(to_columns))
)
# Prints the occurences of the invalid cities (ones that aren't in the cities dataset)
# This is either because of formatting errors, or it's a small city that's not in the dataset
len(inv), sorted(inv.items(), key=lambda x: -x[1])[:10]
(11173,
[('West Midlands, England, United Kingdom, Split: West Midlands, England, GB',
806),
('West Yorkshire, England, United Kingdom, Split: West Yorkshire, England, GB',
411),
('Austin, Texas Metropolitan Area, Split: Austin, TX, US', 356),
('Charleston, South Carolina Metropolitan Area, Split: Charleston, SC, US',
354),
('Brick, NJ, Split: Brick, New Jersey, US', 351),
('Little London, England, United Kingdom, Split: Little London, England, GB',
346),
('University Park, PA, Split: University Park, Pennsylvania, US', 329),
('Hull, England, United Kingdom, Split: Hull, England, GB', 327),
('West Chester, OH, Split: West Chester, Ohio, US', 320),
('Scotland, United Kingdom, Split: London, Scotland, GB', 318)])
del valid_cities # Clear memory
Now we finally merge the postings table with the cities dataset. We do a left join so that we can see which rows failed (by checking null values)
df = df.merge(
cities,
how="left",
left_on=["city", "state", "country"],
right_on=["name", "state_name", "country_code"],
)
df
| job_link | last_processed_time | got_summary | got_ner | is_being_worked | job_title | company | job_location | first_seen | search_city | ... | job_loc_list | city | state | country | name | state_name | country_code | country_name | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://www.linkedin.com/jobs/view/account-exe... | 2024-01-21 07:12:29.002560+00:00 | True | True | False | Account Executive - Dispensing (NorCal/Norther... | BD | San Diego, CA | 2024-01-15 | Coronado | ... | [San Diego, CA] | San Diego | California | US | San Diego | California | US | United States | 32.71571 | -117.16472 |
| 1 | https://www.linkedin.com/jobs/view/registered-... | 2024-01-21 07:39:58.881370+00:00 | True | True | False | Registered Nurse - RN Care Manager | Trinity Health MI | Norton Shores, MI | 2024-01-14 | Grand Haven | ... | [Norton Shores, MI] | Norton Shores | Michigan | US | Norton Shores | Michigan | US | United States | 43.16890 | -86.26395 |
| 2 | https://www.linkedin.com/jobs/view/restaurant-... | 2024-01-21 07:40:00.251126+00:00 | True | True | False | RESTAURANT SUPERVISOR - THE FORKLIFT | Wasatch Adaptive Sports | Sandy, UT | 2024-01-14 | Tooele | ... | [Sandy, UT] | Sandy | Utah | US | Sandy | Utah | US | United States | 40.59161 | -111.88410 |
| 3 | https://www.linkedin.com/jobs/view/independent... | 2024-01-21 07:40:00.308133+00:00 | True | True | False | Independent Real Estate Agent | Howard Hanna | Rand Realty | Englewood Cliffs, NJ | 2024-01-16 | Pinehurst | ... | [Englewood Cliffs, NJ] | Englewood Cliffs | New Jersey | US | Englewood Cliffs | New Jersey | US | United States | 40.88538 | -73.95236 |
| 4 | https://www.linkedin.com/jobs/view/registered-... | 2024-01-21 08:08:19.663033+00:00 | True | True | False | Registered Nurse (RN) | Trinity Health MI | Muskegon, MI | 2024-01-14 | Muskegon | ... | [Muskegon, MI] | Muskegon | Michigan | US | Muskegon | Michigan | US | United States | 43.23418 | -86.24839 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1294263 | https://www.linkedin.com/jobs/view/registered-... | 2024-01-20 15:21:07.786118+00:00 | True | True | False | Registered Nurse (RN) #CP-RN-7998660 - 2411627... | TravelNurseSource | Providence, RI | 2024-01-14 | Fall River | ... | [Providence, RI] | Providence | Rhode Island | US | Providence | Rhode Island | US | United States | 41.82399 | -71.41283 |
| 1294264 | https://www.linkedin.com/jobs/view/constructio... | 2024-01-20 15:21:10.885264+00:00 | True | True | False | Construction Superintendent | Jobot | New Iberia, LA | 2024-01-15 | Lafayette | ... | [New Iberia, LA] | New Iberia | Louisiana | US | New Iberia | Louisiana | US | United States | 30.00354 | -91.81873 |
| 1294265 | https://www.linkedin.com/jobs/view/executive-c... | 2024-01-21 07:40:00.304641+00:00 | True | True | False | Executive Chef, Operations Support | NEXDINE Hospitality | Riverhead, NY | 2024-01-14 | Eastport | ... | [Riverhead, NY] | Riverhead | New York | US | Riverhead | New York | US | United States | 40.91704 | -72.66204 |
| 1294266 | https://www.linkedin.com/jobs/view/rn-register... | 2024-01-21 00:38:39.816821+00:00 | True | True | False | RN- Registered Nurse, Analyst - - 23934913EXPP... | TravelNurseSource | Aurora, CO | 2024-01-16 | Colorado | ... | [Aurora, CO] | Aurora | Colorado | US | Aurora | Colorado | US | United States | 39.72943 | -104.83192 |
| 1294267 | https://www.linkedin.com/jobs/view/on-demand-g... | 2024-01-21 00:38:44.231492+00:00 | True | True | False | On-Demand: Guest Advocate (Cashier), General M... | Target | Culver City, CA | 2024-01-12 | Malibu | ... | [Culver City, CA] | Culver City | California | US | Culver City | California | US | United States | 34.02112 | -118.39647 |
1294268 rows × 26 columns
These are the examples of failed rows in the merge, and these are the job_location formats that we seek to include in the to_columns() function.
# Invalid merge rows
m = df[df["latitude"].isnull()]
print(m["country"].value_counts())
m
country US 41903 GB 15871 CA 5395 AU 2335 MX 644 FR 29 MY 25 AE 10 IN 10 SA 9 PH 6 AR 5 ES 5 IT 5 BR 4 GR 4 IE 4 BE 3 CN 3 CH 3 HU 3 TR 3 SE 2 DK 2 CM 2 EG 1 NG 1 DE 1 ID 1 TZ 1 JP 1 CZ 1 PL 1 TW 1 CL 1 Name: count, dtype: int64
| job_link | last_processed_time | got_summary | got_ner | is_being_worked | job_title | company | job_location | first_seen | search_city | ... | job_loc_list | city | state | country | name | state_name | country_code | country_name | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | https://www.linkedin.com/jobs/view/special-age... | 2024-01-21 08:08:21.308995+00:00 | True | True | False | Special Agent: Law/Legal Background | Federal Bureau of Investigation (FBI) | Austin, Texas Metropolitan Area | 2024-01-16 | Austin | ... | [Austin, Texas Metropolitan Area] | Austin | TX | US | NaN | NaN | NaN | NaN | NaN | NaN |
| 17 | https://www.linkedin.com/jobs/view/analyst-ass... | 2024-01-21 02:01:15.820041+00:00 | True | True | False | Analyst, Associate Plan Configuration & Qualit... | RITE AID | Yocumtown, PA | 2024-01-13 | Lebanon | ... | [Yocumtown, PA] | Yocumtown | Pennsylvania | US | NaN | NaN | NaN | NaN | NaN | NaN |
| 25 | https://uk.linkedin.com/jobs/view/property-man... | 2024-01-21 06:01:13.568960+00:00 | True | True | False | Property Manager | Knight Dragon Ltd | Royal Borough of Greenwich, England, United Ki... | 2024-01-14 | Basildon | ... | [Royal Borough of Greenwich, England, United K... | Royal Borough of Greenwich | England | GB | NaN | NaN | NaN | NaN | NaN | NaN |
| 59 | https://uk.linkedin.com/jobs/view/ct-technolog... | 2024-01-21 06:01:14.166308+00:00 | True | True | False | CT Technologist | Host Healthcare, Inc. | Elsworth, England, United Kingdom | 2024-01-16 | Peterborough | ... | [Elsworth, England, United Kingdom] | Elsworth | England | GB | NaN | NaN | NaN | NaN | NaN | NaN |
| 104 | https://uk.linkedin.com/jobs/view/corporate-re... | 2024-01-21 07:43:54.617770+00:00 | True | True | False | Corporate Real Estate Manager (f/m/div) | Siemens Gamesa | Hull, England, United Kingdom | 2024-01-17 | Kingston upon Hull | ... | [Hull, England, United Kingdom] | Hull | England | GB | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1294185 | https://www.linkedin.com/jobs/view/mascot-at-w... | 2024-01-20 05:13:01.735342+00:00 | True | True | False | Mascot | Westgate Resorts | River Ranch, FL | 2024-01-14 | Florida | ... | [River Ranch, FL] | River Ranch | Florida | US | NaN | NaN | NaN | NaN | NaN | NaN |
| 1294200 | https://www.linkedin.com/jobs/view/fleet-maint... | 2024-01-20 05:13:08.025382+00:00 | True | True | False | Fleet Maintenance Manager - Year Round in Echo... | Energy Jobline | Echo, UT | 2024-01-17 | Bountiful | ... | [Echo, UT] | Echo | Utah | US | NaN | NaN | NaN | NaN | NaN | NaN |
| 1294221 | https://www.linkedin.com/jobs/view/scientist-i... | 2024-01-21 08:28:03.629178+00:00 | True | True | False | Scientist IV - Clearance Preferred | LMI | Fort Detrick, MD | 2024-01-14 | Fairfield | ... | [Fort Detrick, MD] | Fort Detrick | Maryland | US | NaN | NaN | NaN | NaN | NaN | NaN |
| 1294231 | https://ca.linkedin.com/jobs/view/academic-tut... | 2024-01-20 13:46:05.388491+00:00 | True | True | False | Academic tutoring Private Tutoring Jobs Catalone | ClickJobs.io | Catalone, Nova Scotia, Canada | 2024-01-16 | Nova Scotia | ... | [Catalone, Nova Scotia, Canada] | Catalone | Nova Scotia | CA | NaN | NaN | NaN | NaN | NaN | NaN |
| 1294245 | https://www.linkedin.com/jobs/view/ft-center-s... | 2024-01-20 13:46:10.289403+00:00 | True | True | False | FT Center Store Manager (H) | Food Lion | Powhatan, VA | 2024-01-14 | Abingdon | ... | [Powhatan, VA] | Powhatan | Virginia | US | NaN | NaN | NaN | NaN | NaN | NaN |
69709 rows × 26 columns
df = df.dropna()
df.reset_index(inplace=True, drop=True)
We only lost ~5% of the original data which isn't terrible, but definitely could use improvement.
print(
"Percent of original dataset retained:", round(100 * df.shape[0] / 1300000, 3)
)
Percent of original dataset retained: 94.197
3b. Basic Data Exploration and Summary Statistics¶
Now thanks to the cities dataset, we can get a cool choropleth plot which you can interact with via plotly. It shows the number of jobs per country.
map_df = df.groupby(["country", "country_name"]).size()
map_df = pd.DataFrame(map_df).reset_index()
map_df.rename(
{"country": "alpha_2", "country_name": "country", 0: "job count"},
inplace=True,
axis=1,
)
map_df["Log10(job count)"] = np.log10(map_df["job count"])
map_df["alpha_3"] = map_df["alpha_2"].map(
lambda x: pycountry.countries.get(alpha_2=x).alpha_3
)
map_df
fig = px.choropleth(
map_df,
locations="alpha_3",
hover_name="country",
color="Log10(job count)",
hover_data=["job count"],
projection="orthographic",
color_continuous_scale="viridis",
)
fig.update_layout(
margin=dict(l=0, r=0, t=0, b=0, pad=4, autoexpand=True),
width=600,
)
fig.update_coloraxes(colorbar_len=0.6)
fig.show()
Here, we can do the same thing for cities except with a geospatial scatter plot. Since there's an insanely large amount of cities, we limit it only to cities with > 100 jobs so the map isn't extremely laggy.
latlon = df.groupby(["latitude", "longitude", "city", "state", "country"]).size()
latlon = pd.DataFrame(latlon).reset_index().rename({0: "count"}, axis=1)
fig = px.scatter_geo(
latlon.query("count > 100"),
lat="latitude",
lon="longitude",
size="count",
color="count",
color_continuous_scale="viridis",
hover_data=["city", "state", "country"],
opacity=1,
projection="orthographic",
)
fig.update_layout(
margin=dict(l=0, r=0, t=0, b=0, pad=4, autoexpand=True),
width=600,
)
fig.update_coloraxes(colorbar_len=0.6)
fig.show()
del latlon # Clear memory
top3 = map_df.sort_values("job count", ascending=False).iloc[:3][
["country", "job count"]
]
display(top3)
top3 = set(top3["country"].values)
| country | job count | |
|---|---|---|
| 24 | United States | 1061554 |
| 8 | United Kingdom | 91708 |
| 3 | Canada | 45952 |
Recall that the job_level column consists of 2 values: associate and mid-senior. We want to see which countries have a higher proportion of high-level. Keep in mind that Mid-Senior == 1 and Associate == 0, so the mean of the column is equivalent to the proportion of mid-senior positions. To do this, we did a 2 sample z test for proportion. However, since we're comparing 3 countries, we will need to do 3 of these hypothesis tests. ANOVA should likely not be used, since we are dealing with a binomial variable, which cannot be assumed to be random (proportion is technically the mean of a binomial variable).
We will use $\alpha$=0.05
num_mid_high = df.query("country_name in @top3").copy()
num_mid_high["job_level"] = num_mid_high["job_level"].apply(
lambda x: job_level[x]
) # Mid-Senior == 1, Associate == 0
num_mid_high = num_mid_high.groupby("country_name")["job_level"].sum().sort_values()
num_jobs = (
df.query("country_name in @top3")
.groupby("country_name")["job_level"]
.count()
.sort_values()
)
props = num_mid_high / num_jobs
fig = props.plot(kind="bar", backend="plotly")
fig.update_layout(
yaxis_range=[0, 1],
title="Sample Proportion of Mid-Senior Jobs by (Top 3) Countries",
)
This is supported by the $\chi^2$ test for the same data ($p<<\alpha$). We create the contingency table (and a proportion table just for visualization), and carry out the $\chi^2$ test.
contingency_table = pd.crosstab(
df.query("country_name in @top3")["country_name"],
df.query("country_name in @top3")["job_level"],
)
display(contingency_table)
display((contingency_table.T / contingency_table.sum(axis=1)).T)
display(
contingency_table.plot(
kind="bar", title="Frequency", backend="plotly", barmode="group"
)
)
display(
((contingency_table.T / contingency_table.sum(axis=1)).T).plot(
kind="bar", title="Probability distributions", backend="plotly", barmode="group"
)
)
stats.chi2_contingency(contingency_table)
| job_level | Associate | Mid senior |
|---|---|---|
| country_name | ||
| Canada | 5134 | 40818 |
| United Kingdom | 8166 | 83542 |
| United States | 117017 | 944537 |
| job_level | Associate | Mid senior |
|---|---|---|
| country_name | ||
| Canada | 0.111725 | 0.888275 |
| United Kingdom | 0.089043 | 0.910957 |
| United States | 0.110232 | 0.889768 |
Chi2ContingencyResult(statistic=395.8733435193279, pvalue=1.0894208642323106e-86, dof=2, expected_freq=array([[ 4993.54309072, 40958.45690928],
[ 9965.78712056, 81742.21287944],
[115357.66978871, 946196.33021129]]))
Since we rejected the null hypothesis, we do a post hoc test, by doing a post hoc test showing for the difference in proportion between each country. These 2 sample Z tests for the proportion of jobs which are mid-senior confirm this. They state with high confidence that the UK has more than USA and Canada. However, the proportion of mid-senior jobs seem to be about the same for the U.S. and Canada ($\alpha > 0.05$ so we can't reject the null hypothesis that they are equal)
print("USA > Canada")
print(
proportions_ztest(
num_mid_high[["United States", "Canada"]],
num_jobs[["United States", "Canada"]],
alternative="larger",
)
)
print("UK > USA")
print(
proportions_ztest(
num_mid_high[["United Kingdom", "United States"]],
num_jobs[["United Kingdom", "United States"]],
alternative="larger",
)
)
print("UK > Canada")
print(
proportions_ztest(
num_mid_high[["United Kingdom", "Canada"]],
num_jobs[["United Kingdom", "Canada"]],
alternative="larger",
)
)
USA > Canada (1.0005819489094683, 0.1585144803056488) UK > USA (19.790144761015508, 1.8100028997368962e-87) UK > Canada (13.432934673059961, 1.9386787397135185e-41)
Cool. Now we look at the job_type, which is Remote, Onsite, or Hybrid. We also compare these between our top 3 countries. We will do this using another $\chi^2$ test.
contingency_table = contingency_table = pd.crosstab(
df.query("country_name in @top3")["country_name"],
df.query("country_name in @top3")["job_type"],
)
display(contingency_table)
display((contingency_table.T / contingency_table.sum(axis=1)).T)
display(
contingency_table.plot(
kind="bar", title="Frequency", backend="plotly", barmode="group"
)
)
display(
((contingency_table.T / contingency_table.sum(axis=1)).T).plot(
kind="bar", title="Probability distributions", backend="plotly", barmode="group"
)
)
stats.chi2_contingency(contingency_table)
| job_type | Hybrid | Onsite | Remote |
|---|---|---|---|
| country_name | |||
| Canada | 326 | 45515 | 111 |
| United Kingdom | 774 | 90723 | 211 |
| United States | 4909 | 1053014 | 3631 |
| job_type | Hybrid | Onsite | Remote |
|---|---|---|---|
| country_name | |||
| Canada | 0.007094 | 0.990490 | 0.002416 |
| United Kingdom | 0.008440 | 0.989259 | 0.002301 |
| United States | 0.004624 | 0.991955 | 0.003420 |
Chi2ContingencyResult(statistic=330.6899939082778, pvalue=2.585770377761359e-70, dof=4, expected_freq=array([[2.30255457e+02, 4.55702718e+04, 1.51472761e+02],
[4.59528801e+02, 9.09461718e+04, 3.02299443e+02],
[5.31921574e+03, 1.05273556e+06, 3.49922780e+03]]))
$p<<\alpha$, so we reject the null hypothesis. The frequency of remote, hybrid, and onsite jobs is clearly different for each of the countries.
Now for the next tests, we will look at the companies. In particular, we want to see how large the companies are in each country, as large companies may indicate a strong and stable economy. To do this, we can take a look at the distribution of job counts among the top companies in each country. For example, we look at the distribution of the top 50 companies from the U.S., Canada, and U.K. The company names are not a numerical variable, but if we replace them with their rank (1st largest company, 2nd largest company, ...), then we can get ourselves a probability mass function.
We will then want to compare the 3 probability distributions. To do this, we use the 2 Sample Kolmogorov-Smirnov test for goodness of fit between each of the 3 countries.
company_counts = (
df.query("country_name in @top3")
.groupby("country_name")["company"]
.value_counts()
.groupby("country_name")
.head(50)
.reset_index()
)
company_counts
| country_name | company | count | |
|---|---|---|---|
| 0 | Canada | CareerBeacon | 1355 |
| 1 | Canada | ClickJobs.io | 1197 |
| 2 | Canada | Canadian Armed Forces | Forces armées canadiennes | 812 |
| 3 | Canada | Stantec | 719 |
| 4 | Canada | Vancouver Coastal Health | 663 |
| ... | ... | ... | ... |
| 145 | United States | WIS International | 1839 |
| 146 | United States | The Home Depot | 1789 |
| 147 | United States | Northrop Grumman | 1746 |
| 148 | United States | Trinity Health | 1744 |
| 149 | United States | Ross Stores, Inc. | 1734 |
150 rows × 3 columns
fig = make_subplots(3, 1)
for i, country in enumerate(["United States", "Canada", "United Kingdom"]):
fig.add_trace(
go.Bar(
text=company_counts.query("country_name == @country")["company"],
y=company_counts.query("country_name == @country")["count"],
name=country,
),
row=i + 1,
col=1,
)
fig.update_layout(
height=2000, title="Top 50 Company Job Distribution for each of the Top 3 Countries"
)
fig.update_xaxes(title="Nth Largest Company")
fig.update_yaxes(title="Number of Jobs")
fig
Scipy's 2 sample KS test needs a sample, not a PMF. So, we just create that for each country and print out the statistics for each sample.
# Gets samples from each of the countries
pmfs = {}
for country in top3:
data = []
counts = company_counts.query("country_name == @country")["count"].reset_index()[
"count"
]
for i, x in enumerate(counts):
data += [i] * x
pmfs[country] = data
sum_stats = pd.DataFrame({country: data}).describe()
sum_stats.loc["skew"] = stats.skew(data)
display(sum_stats)
| United States | |
|---|---|
| count | 259497.000000 |
| mean | 13.188993 |
| std | 13.685892 |
| min | 0.000000 |
| 25% | 2.000000 |
| 50% | 8.000000 |
| 75% | 22.000000 |
| max | 49.000000 |
| skew | 0.987267 |
| Canada | |
|---|---|
| count | 13950.000000 |
| mean | 14.430323 |
| std | 13.962532 |
| min | 0.000000 |
| 25% | 3.000000 |
| 50% | 10.000000 |
| 75% | 24.000000 |
| max | 49.000000 |
| skew | 0.891284 |
| United Kingdom | |
|---|---|
| count | 23149.000000 |
| mean | 14.552162 |
| std | 14.149398 |
| min | 0.000000 |
| 25% | 2.000000 |
| 50% | 10.000000 |
| 75% | 24.000000 |
| max | 49.000000 |
| skew | 0.848315 |
for c1, c2 in itertools.combinations(top3, 2):
print("Comparing distributions of:", c1, ",", c2)
print(stats.ks_2samp(pmfs[c1], pmfs[c2]))
Comparing distributions of: United States , Canada KstestResult(statistic=0.061611845347954136, pvalue=3.9712986155353705e-44, statistic_location=2, statistic_sign=1) Comparing distributions of: United States , United Kingdom KstestResult(statistic=0.054722840970027664, pvalue=9.298692564618634e-56, statistic_location=3, statistic_sign=1) Comparing distributions of: Canada , United Kingdom KstestResult(statistic=0.05345704181311936, pvalue=4.628346419662773e-22, statistic_location=0, statistic_sign=-1)
All null hypotheses have been rejected ($p<<\alpha$), meaning that all of the pmf's are statistically different for each country. According to the statistics and graph above, it's shown that Canada has the more "flat" distribution while U.S. and Australia have a stronger peak at 0, meaning larger companies.
Conclusions of Exploratory Data Analysis¶
- The VAST amount of job postings on LinkedIn are located in the U.S., Canada, U.K., and Australia. We choose the top 3 to do hypothesis testing on.
- Jobs are heavily concentrated in major cities, as seen by the geoscatter plot.
- We have shown with statistical significance that the proportion of mid-senior jobs between the three countries (U.S., U.K., Canada) is as follows in ascending order: Canada, United States, United Kingdom. We also followed up with Post-Hoc testing to confirm this.
- We have shown with statistical significance that the frequencies distributions for the job type (On-site, remote, hybrid) differ from country to country.
- We have shown with statistical significance that the distribution for job concentration among the top 50 companies in each country differ, and from the summary statistics and plots, we can determine that Canada has more equally large companies in the top 50, whereas the U.S. and Australia have a more concentrated company size towards the top.
4. Primary Analysis¶
Now it's finally time to look at the text fields (the job skills, and we ignore the summaries because doing this analysis on them is just too computationally heavy). Since we have a list of job skills for each posting, we want to somehow group them together into, perhaps, different industries, fields, etc. While we do have some sort of indication for this via the job_title column, this is VERY unformatted, so supervised learning is out the window. We instead want to do unsupervised learning using K Means Clustering. We want to cluster the different sentences, but in order to do so, we need to vectorize each sentence. To do this, we use SentenceTransformers using the all-MiniLM-L6-v2 transformer to turn each sentence into a 384 element vector.
transformer = SentenceTransformer("all-MiniLM-L6-v2", cache_folder=".")
This code actually took like 20 minutes on a GTX 1080 so we save it the first time it's generated so we only have to generate it once. I would not recommend running this on a CPU... As you can see, for every one of the ~1.2 million postings, we have a 384 length vector associated with the skills for that posting.
if os.path.isfile("skills_encodings.npz"):
encodings = np.load("skills_encodings.npz")["skills"]
else:
encodings = transformer.encode(df.job_skills)
np.savez_compressed("skills_encodings.npz", skills=encodings)
print(encodings.shape)
(1224559, 384)
To figure out the number of clusters for the K Means Clustering, we use both the elbow method and we look at the silhouette scores.
del cities, length_of_cities, m
silhouette = []
wcss = []
clusters = [2, 3, 4, 5, 7, 9, 11, 13, 15]
for i in clusters:
model = cluster.KMeans(
n_clusters=i, max_iter=100, init="k-means++", random_state=123
).fit(encodings)
silhouette.append(
metrics.silhouette_score(encodings, model.labels_, sample_size=1000)
)
wcss.append(model.inertia_)
print(f"{i} clusters done")
silhouette_data = pd.DataFrame(
{"Number of Clusters": clusters, "Silhouette Coefficient": silhouette}
)
wcss_data = pd.DataFrame({"Number of Clusters": clusters, "WCSS": wcss})
px.line(silhouette_data, x="Number of Clusters", y="Silhouette Coefficient").show()
px.line(wcss_data, x="Number of Clusters", y="WCSS").show()
2 clusters done 3 clusters done 4 clusters done 5 clusters done 7 clusters done 9 clusters done 11 clusters done 13 clusters done 15 clusters done
With just the elbow method, it would have been unclear. However, with the silhouette score, we can determine that 7 is an optimal number of clusters, so that's what we'll use. A higher silhouette score is desirable, but 2-4 clusters is insufficient for this many groups of jobs. Our goal is to get a cluster for tech jobs.
num_clusters = 7
labels = cluster.KMeans(
n_clusters=num_clusters,
max_iter=100,
init="k-means++",
random_state=123,
).fit_predict(encodings)
5. Visualization¶
Now we want to see the sentences in the different groups to view the clusters that the model formed, and what they could mean.
clustered_sentences = [[] for _ in range(num_clusters)]
for i in range(len(labels)):
clustered_sentences[labels[i]].append(df.job_skills[i])
We print a Word Cloud, which shows the most frequent phrases or words as large text. We create one for every cluster so that we can see what each cluster indicates.
from collections import Counter
stop_words = [
"to",
"in",
"and",
"of",
"a",
"s",
"skills",
"management",
"with",
]
for i in range(num_clusters):
word_counter = Counter(" ".join(clustered_sentences[i]).split(" "))
for stop_word in stop_words:
if stop_word in word_counter:
del word_counter[stop_word]
fig = px.imshow(WordCloud(width=1920, height=1080, random_state=123).generate_from_frequencies(word_counter))
fig.update_xaxes(showticklabels=False).update_yaxes(showticklabels=False)
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0)).show()
It seemed to work pretty well. We can describe each cluster relatively well.
- Cluster 0: Therapist / Counselor
- Cluster 1: Business Analytics
- Cluster 2: General Inventory / Restaurant Jobs
- Cluster 3: Computer Science Jobs
- Cluster 4: Medical Jobs
- Cluster 5: Business Management
- Cluster 6: Retail jobs
We can now assign every job a predicted label, and store this relationship.
clusters = {
0: "Therapist / Counselor",
1: "Business Analytics",
2: "General Inventory / Restaurant Jobs",
3: "Computer Science Jobs",
4: "Medical Jobs",
5: "Business Management",
6: "Retail jobs",
}
df["labels"] = labels.copy()
C:\Users\Vikram\AppData\Local\Temp\ipykernel_4580\3709420813.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A quick preview to check if it worked by taking a look at the job_title of some of the tech jobs. As we can see, it appears to have worked, since the job titles look like they are related to computing or mostly STEM related positions!
df.query("labels == 3")["job_title"][:20]
19 Test Conductor - Space Systems 34 Sr Experience Design Manager, Learn and Help 47 Designer I 51 CIBC | Analyste Principal, Intégration des cli... 84 Senior Network Engineering Lead - Remote 98 Lead Software Engineer, Back End 99 Founding Fullstack Engineer, AI Startup 101 Design Strategy Manager 131 RESEARCH ANALYST II 141 Android Engineer 170 Ad Operations Manager (Hulu) 171 Senior Software Engineer, DevOps 173 Senior Machine Learning Engineer 198 Contracts Lifecycle Management (CLM) Solutions... 211 Senior Specialist Programmer 225 IT / Business Framework Specialist 242 R&D Scientist 243 Distinguished Applied Researcher 270 Senior Flutter Developer 278 Principal Software Engineer, ML Accelerators Name: job_title, dtype: object
Now we can try it using topic modeling as well, since it may be a better method here. Also we free up some memory. Note that this is sort of a combination of steps 4 and 5 (primary analysis and visualization), but it's fine
del encodings, labels, word_counter
cv = feature_extraction.text.CountVectorizer(analyzer="word", ngram_range=(2, 2), min_df=0.01, max_df=0.5)
encodings = cv.fit_transform(df["job_skills"])
encodings.shape
(1224559, 269)
This also takes a long time. Set the n_jobs parameter to just under how many CPU cores you have to speed it up a LOT. It took almost 8 minutes using 14 cores on a Ryzen 7 1700 (a pretty strong CPU even by today's standards), as a reference.
labels = decomposition.LatentDirichletAllocation(n_components=7, random_state=123, n_jobs=14).fit_transform(encodings)
labels.shape
(1224559, 7)
Now we do the same grouping and word cloud as before
clustered_sentences = [[] for _ in range(num_clusters)]
for i in range(len(labels)):
clustered_sentences[labels[i].argmax()].append(df.job_skills[i])
from collections import Counter
# Exclude these words
stop_words = [
"to",
"in",
"and",
"of",
"a",
"s",
"skills",
"management",
"with",
]
for i in range(num_clusters):
word_counter = Counter(" ".join(clustered_sentences[i]).split(" "))
for stop_word in stop_words:
if stop_word in word_counter:
del word_counter[stop_word]
fig = px.imshow(WordCloud(width=1920, height=1080, random_state=123).generate_from_frequencies(word_counter))
fig.update_xaxes(showticklabels=False).update_yaxes(showticklabels=False)
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0)).show()
del encodings, labels, word_counter # Clear memory again
As you can see, there is somewhat of a cluster for computing jobs (cluster 2), but it is not as pronounced as the KMeans method. Therefore, we'll stick with the kmeans values, which we've already stored in the dataframe.
6. Insights and Conclusions¶
We can now use these labels to analyze the software jobs. But before that, let's take a look at the distribution of jobs between the clusters.
fig = df["labels"].value_counts().rename(index=clusters).plot.bar()
fig.update_layout(
title="Count of Jobs per Cluster", xaxis_title="Clusters", yaxis_title="Count"
)
So we have a solid 92k CS jobs to look at. Now we can do hypothesis tests on these between countries, which is similar in what we did earlier.
fig = (
df.query("labels == 3 and country_name in @top3")["country_name"]
.value_counts()
.plot.bar()
)
fig.update_layout(
title="CS Jobs by Country", xaxis_title="Top 3 Countries", yaxis_title="Count"
)
To do the hypothesis test, we want to look at the proportion of CS jobs in each of the top 3 countries and compare those using a $\chi^2$ test.
Here, the "True" label is CS jobs, while "False" are the non-CS jobs
contingency_table = pd.crosstab(
df.query("country_name in @top3")["country_name"],
df.query("country_name in @top3")["labels"] == 3,
)
display(contingency_table)
display((contingency_table.T / contingency_table.sum(axis=1)).T)
display(
contingency_table.plot(
kind="bar", title="Frequency", backend="plotly", barmode="group"
)
)
display(
((contingency_table.T / contingency_table.sum(axis=1)).T).plot(
kind="bar", title="Probability distributions", backend="plotly", barmode="group"
)
)
stats.chi2_contingency(contingency_table)
| labels | False | True |
|---|---|---|
| country_name | ||
| Canada | 41674 | 4278 |
| United Kingdom | 85089 | 6619 |
| United States | 982405 | 79149 |
| labels | False | True |
|---|---|---|
| country_name | ||
| Canada | 0.906903 | 0.093097 |
| United Kingdom | 0.927825 | 0.072175 |
| United States | 0.925440 | 0.074560 |
Chi2ContingencyResult(statistic=230.07197381576603, pvalue=1.0977556110093208e-50, dof=2, expected_freq=array([[ 42501.57848057, 3450.42151943],
[ 84821.87411421, 6886.12588579],
[981844.54740522, 79709.45259478]]))
At this sample size, it's not surprising that the p value is almost 0. Also, it shows that Canada ends up having the highest proportion of CS jobs on LinkedIn, with the U.S. in 2nd and the U.K. in third.
Final Conclusions¶
- On LinkedIn, there are many more jobs in the U.S. than other countries.
- There are a higher proportion of senior jobs in the U.K. compare to the U.S. and Canada
- There really aren't that many remote or hybrid jobs on LinkedIn
- For Computer Science jobs, they are most common in Canada, with the U.S. next and the U.K. last
- There are MANY medical job postings on LinkedIn, if that interests you
I hope that these insights will help you on your job search. If the above restrictions affect you, then it is recommended to other websites (Indeed, etc.)